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ABSTRACT 

We present and validate BlastR, a method for effi- 
ciently and accurately searching non-coding RNAs. 
Our approach relies on the comparison of di-nucleo- 
tides using BlosumR, a new log-odd substitution 
matrix. In order to use BlosumR for comparison, 
we recoded RNA sequences into protein-like se- 
quences. We then showed that BlosumR can be 
used along with the BlastP algorithm in order to 
search non-coding RNA sequences. Using Rfam as 
a gold standard, we benchmarked this approach 
and show BlastR to be more sensitive than BlastN. 
We also show that BlastR is both faster and more 
sensitive than BlastP used with a single nucleotide 
log-odd substitution matrix. BlastR, when used in 
combination with WU-BlastP, is about 5% more 
accurate than WU-BlastN and about 50 times 
slower. The approach shown here is equally effect- 
ive when combined with the NCBI-Blast package. 
The software is an open source freeware available 
from www.tcoffee.org/blastr.html. 

INTRODUCTION 

We describe in this work, a strategy to efficiently and ac- 
curately search databases for homologous non-coding 
RNAs (ncRNAs). This problem is growing in import- 
ance, mostly because so many new classes of ncRNAs 
have been recently reported (1^), revealing ncRNA in- 
volvement at virtually all cellular levels, especially gene 



regulation. In broad terms, ncRNAs can be divided into 
two classes: those that are active via a well-defined and 
evolutionary constrained secondary and tertiary structure, 
and those for which no evidence is yet available to suggest 
a hnk between function and structure. The Rfam data- 
base is mostly made of RNAs with a well-defined and 
conserved secondary structure (5). The second group 
includes many of the recently discovered ncRNAs 
involved in gene regulation, such as Piwi-interacting 
RNAs (1) involved in gene silencing, hncRNAs (3) a 
novel class of regulated ncRNAs, epigenetic regulators- 
Hke HOTAIR (HOX antisense intergenic RNA) (6) or 
nuclear trafficking regulators-like NRON (ncRNA 
repressor of NFAT) (7). 

Further characterizing these new classes of genes has 
become a prime target for biology, resulting in the devel- 
opment of an increasing number of in-silico methods 
for the prediction and comparison of RNA sequences. 
Analyzing ncRNAs is, however, a complex task. First of 
all, the genes are hard to discover, and although some of 
them, such as hncRNAs, exhibit the same properties as 
protein-coding genes, with introns/exon structures, similar 
chromatin markings (3) and capped poly-adenylated tran- 
scripts (8), they lack the equivalent of an open reading 
frame that could ease their detection. Homology-based 
analysis is not much easier since RNA sequences tend to 
evolve rapidly, either under the sole constraint of main- 
taining stable secondary structures (in the case of struct- 
ured RNAs), or under unknown functional constraints 
whose effect on sequence variation is hard to anticipate. 
As a consequence, their level of sequence conservation is 
limited and often too weak to yield statistically meaningful 
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alignments, even when considering moderate evolutionary 
distances such as mouse versus human. Using standard 
database search tools like Blast is, therefore, difficult 
and rarely as informative as in the case of protein-coding 
genes. 

More sensitive searches are possible, using profiles for 
instance, but the highest level of sensitivity is only achieved 
when using algorithmic methods simultaneously taking 
into account sequence similarity and predicted (or experi- 
mentally known) secondary structures hke the Sankoff 
algorithm does (9). These methods are computationally 
very expensive [Sankoff has a time complexity of 0(L^ ) 
where N is the number of sequences and L their length] 
and their practical usage requires the design of heuristic 
versions with lower complexity. These include Genetic 
Algorithms (10) or banded Sankoff implementations Hke 
Consan (11). In practice, however, the most popular ap- 
plications to perform efficient database searches make use 
of stochastic context free grammars (SCFG) (12). These 
include Rsearch (13) and RaveNna (14) that are based on 
Infernal (15). Although their algorithmic details vary, the 
overall principle is similar: informing the database search 
with some secondary structure information. 

A special form of SCFGs called covariance models 
(CMs) is widely used today in RNA homology search 
(16). Conceptually, CMs are similar to protein profiles, 
albeit computationally much more expensive to use. CM 
parameters are trained on high-quality (seed) alignments. 
In the Rfam framework (5), for example, seed alignments 
are taken from literature and are often manually refined to 
ensure high quahty [e.g. using editors like RALEE (RNA 
ahgnment editor in Emacs) (17)]. These alignments are 
used to compute CMs, which are in turn used by 
Infernal to search databases. The whole setup is similar 
to HMM (Hidden Markov Model) searches using 
HMMER for Pfam (18,19) with the seed ahgnment 
quahty being a critical parameter (20). Infernal searches 
are much more computationally demanding than 
HMMER-based searches, even when considering recent 
improvements (21). To reduce the search space a 
pre-filtering step is often necessary (22). This can be 
done by using tools especially developed for this task, 
e.g. (14) or (23) or by means of a Blast (24) search with 
'relaxed' parameters (to avoid filtering out true positives). 
Several studies describe the optimization of BlastN par- 
ameters for searching RNA sequences (25,26). The 
method we describe here is precisely addressing this 
pre-filtering step and shall be seen as an attempt to im- 
prove database searches at a reasonable extra computa- 
tional cost over a simple BlastN and without the need of 
profiles, SCFG or accurate secondary stmcture information. 

Our approach takes advantage of the possibility of im- 
proving database search procedures by taking into account 
di-nucleotide content. This idea is not new and neighbor- 
ing nucleotides have been known for a long time to exhibit 
some dependencies reflected in their evolutionary patterns 
(27-30). Durbin et al. (31) even proposed a Markov chain 
model able to take this phenomenon into account. While 
the reahty of this phenomenon is relatively well-accepted 
at the genomic level and known to influence the stability 
of secondary structures at the transcript level (32,33), 



it remains debated whether the quantification of this 
effect can help improving gene finding methods (34,35). 
Di-nucleotides can also be used to improve multiple 
sequence ahgnments, as recently reported by Lu and Sze 
(36). Here we go further and show that in order to use 
di-nucleotide information, RNA sequences can be recoded 
using a 16 letters alphabet, thus making it possible to 
apply protein algorithms [Blast?, BlastClust (24)] to 
RNA sequences. This result is obtained by computing a 
suitable RNA substitution matrix (BlosumR) and using 
Rfam as a validation framework. While similar RNA- 
specific matrices had been previously estimated on riboso- 
mal RNA (13,37), this work is, to our knowledge, the first 
reported attempt to use these models for efficient database 
searches in a Blast-hke framework. 



METHODS 

Computation of the BlosumR matrices 

In biology, the cost for aligning two symbols is often 
estimated using the log ratio between the observed fre- 
quency of substitutions and the expected ones, as measured 
on a collection of reference alignments (pair wise or 
multiple). This approach forms the basis of the PAM 
(point accepted mutation) (38) and BLOSUM (blocks of 
amino acid substitution matrix) matrices (39). We used a 
related approach in this work, to estimate a matrix that 
takes into account the cost for aligning two nucleotides to 
one another while taking into account the nature of their 
immediate neighbor. The cost for aligning two nucleotides 
A^v with A^,. while taking into account Ny-i and iV.v-/ 
(their immediate neighbors) can therefore be expressed 
as follows: 

Scove{N^N^^i,NyNy^i) = log -^^ — -^—^ | 



where /](7V;vJVv|^a—i,jV,_i) is the frequency of an ahgnment 
column N_^^Ny given a neighboring column that contains 
N_y-i, Ny_j. The log-odd ratio in Equation (1) therefore 
reflects the influence of di-nucleotide content on the 
observed substitutions. In practice, this amounts to 'en- 
riching' the substitution cost of each position with some 
information related to di-nucleotide content. 

The counts used to compute our matrix were estimated 
on a data set containing 792 RNA families reported in 
Rfam release 9.1. These families were used because they 
are distinct from the ones used for validation (see next sec- 
tion). We named this matrix BlosumR (Figure IB) because 
it relies on a procedure similar to that described for the 
BLOSUM matrix. For BlosumR, all pairwise projections 
yielding between 62% and 80% identity were extracted 
from the Rfam seed ahgnments. AU positions containing 
gaps in the pairwise projection were excluded. The final 
matrix is scaled in half bits and contains entries reflecting 
the cost for the substitution of any pair of di-nucleotides. 
For vahdation purposes and in order to show the useful- 
ness of di-nucleotide information, we also re-computed 
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A B C 




CAAGCUC AACAGAC CAAAUC ACAGGUCUU 
RNA CAAGCUC AACAGAC CGCAGUGCAGUGCUU 
CAAGCUCAACAGACCGGAAUUAGGCGUUU 




di-nucleotides log odd matrix 



i 

^ BlosumR matrix 




Query: 9 ALWDNGSGMCKAGGGSIQQDAPRAVFPSIVGRPRHTYEVMVGMGQKDSLVGDE--SKRG 66 

ALVVDNGSGMCKAG DAPRAVFPSIVGRPRH VMVGMGQKDS VGDE SKRG 

Sbjct: 7 ALWDNGSGMCKAGFAG— DDAPRAVFPSIVGRPRHQ-GVMVGMGQKDSYVGDEAQSKRG 63 



Figure 1. (A) Receding scheme. The table shown represents the di-nucleotide to amino acid symbol receding scheme. Rows correspond to the first 
letter of the di-nucleotide, while columns to the second. For example the di-nucleotide 'AC is receded to 'E'. (B) BlosumR leg-odd inatrix. The 
matrix assigns a score for each possible substitution between di-nucleotides pairs. (C) BlesumN log-odd matrix. The matrix assigns a score for each 
possible substitution between nucleotides pairs. (D) BlastR pipeline. Rfam seed alignments are used to estimate a log-odd matrix (BlosumR). BlastP 
is then used along with BlosumR to search receded RNA query sequences against a similarly receded RNA database. 



similar matrices from reference alignments with randomly 
reordered columns (i.e. reference alignments with random- 
ized di-nucleotide content). The table used for the exten- 
sion and the BlosumR matrix are displayed in Figure lA 
and B. 

An interesting property of this formulation is its 
rehance on di-nucleotides. Since there are only 16 
di-nucleotides, one can easily recode them using a 16 
letters alphabet. Given a matrix hke the one defined in 
Equation 1, it is therefore possible to recode the RNA 
sequences into pseudo-amino acid sequences and to 
align them with the BlosumR matrix used as a protein-like 
substitution matrix. This recoding involves replacing each 
nucleotide with a symbol coming from an extended 
16-letters alphabet (Figure lA and B). The choice of the 
5' or 3' neighbor is arbitrary since it does not have any 
consequence on the amount and the quality of 



information eventually concentrated on every symbol 
after extension. The recoding is carried out by sliding a 
window of size 2 and replacing every nucleotide with one 
of the 16 letter alphabet symbol. Given an RNA sequence 
of size L, the recoding therefore results in a recoded 
sequence of size L-1. A neutral character is then added 
on the last position to preserve the original sequence 
length. We name such sequences recoded RNAs 
(reRNAs). Once recoded, the sequences can be fed to 
BlastP and searched against a database of extended se- 
quences, as outhned in Figure ID. 

Computation of the BlosumN matrices 

The single nucleotide matrix BlosumN (Figure IC) was 
estimated on the same Rfam reference multiple sequence 
alignments, using the same interval of 62-80% for the 
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selection of pairwise projections. The following formula 
was used to estimate the matrix: 

Score iV,, TV,) = log '/ ^ 

^ *f(Nr)) (2) 



Where f(N^Ny) is the frequency of an alignment column 
Validation data set 

In order to estabhsh the relative merits of the di- 
nucleotide-based searches with respect to simpler methods 
such as BlastN, one requires a reference data set that can 
be used to measure sensitivity and specificity. For this 
purpose we used Rfam which is a collection of RNA ahgn- 
ments. For each family a high-quahty seed alignment and 
a corresponding full ahgnment are provided. Full ahgn- 
ments are created by searching databases for homologues 
of the seed sequences using Infernal. It is appropriate to 
use Rfam as a reference since our goal is not to over- 
perform Infernal in accuracy, but rather to show how 
di-nucleotides based searches can produce results close 
enough to Infernal so that they might be used for efficient 
and sensitive pre-filtering. Vahdation was done on a subset 
of Rfam that we named Rfam-G because it contains all 
the sequences of the 603 families of Rfam 9.0 that can be 
mapped onto a reference genome. Rfam-G contains a 
total of 10409 sequences (591 famihes). 

It is important to point out that the Rfam-G famihes 
are distinct from the 792 Rfam families used for estimating 
the BlosumR and BlosumN matrices. Although the 
genomic location is not used here, the rationale behind 
the development of Rfam-G is to create a framework in 
which both homology search and homology-based RNA 
gene discovery methods can be evaluated with a compar- 
able data set (manuscript in preparation). 

True positive/false positive curves 

One can validate database search procedures by producing 
receiver operator curves (ROC) or a similar representation 
that involves plotting the number of false positives (FPs) 
versus true positives (TPs) while varying some meta- 
parameter affecting the trade-off between specificity and 
sensitivity (typically a threshold). When doing so, the 
method's capacity to separate TPs and FPs is globally 
estimated without any dependence on a meta-parameter 
(threshold). The relative performances of two methods can 
then be compared by measuring the area under the curve 
(AUC) or by comparing the number of identified TPs for a 
given number of accepted FPs. 

In order to produce these graphs, Rfam-G families were 
combined. For each family, we selected as a representative 
sequence the most closely related member (as estimated by 
measuring the average pairwise percent identity on the 
corresponding Rfam-G seed alignment). Each representa- 
tive was then searched against the fuU Rfam-G database 
and in the resulting output, family members of that same 
representative sequence were labeled as Proven Positives 



while non-members were labeled as Proven Negatives. 
This search/labeUng procedure was carried out with the 
representative sequences of aU Rfam-G famihes, resulting 
in 591 distinct Blast outputs (one per family). These 
outputs were concatenated and sorted by is-values. FPs 
versus TPs graphs were produced by scanning the sorted 
output hne by hne, and counting for each Hne the number 
of TPs and the number of FPs (respectively, number of 
Proven Positives and Proven Negatives having a score 
better or equal to that of the considered line). This 
strategy is fairly standard and similar to the one recently 
used by Biegert and Soding (40). As pointed out by these 
authors, large families (Ribosomal RNA or RNAse-P in 
our case) tend to dominate these FP/TP curves, a 
drawback that can be corrected by weighting the contri- 
bution of each sequence with the inverse of its family size 
(i.e. down-weighting the contribution of big families). We 
used this approach to estimate the quantities referred to as 
weighted TPs and weighted FPs in Figure 2 and 3. 

Validation of the clustering capacity 

A methodology was designed in order to estimate the clus- 
tering capacities of BlastR. This methodology estimates 
the capacity of a clustering method (BlastClust, a compo- 
nent of the NCBI-Blast package) to reconstruct a known 
clustering (Rfam-G) when using a comparison method to 
estimate pair-wise sequence similarity. The clustering was 
applied onto a data set made of aU Rfam-G sequences. 
BlastClust clusters were then evaluated using the following 
strategy: given the N Rfam-G sequences, an N x N matrix 
is defined. In the reference matrix (the one built on the 
original Rfam-G families), a vertex is set to 1 and con- 
sidered a Proven Positive whenever the two corresponding 
sequences belong to the same family, the other vertices are 
set to 0 (Proven Negative). Given the output of BlastClust, 
a similar matrix is built, where new-found clusters are used 
to mark the vertices. The two matrices are then compared. 
TPs can be estimated by counting the number of vertices 
set to 1 in the reference matrix and to 1 in the new matrix; 
similarly for FPs (0 and 1), False Negatives (1 and 0) 
and True Negatives (0 and 0). Counts can be weighted 
as described above and used to estimate sensitivity and 
specificity using standard formulas: 

Sn = TP/(TP+FN) (3) 

Sp = TN/(TN+FP) (4) 

We used this method to benchmark BlastClust (Figure 5), 
varying the inclusion threshold in order to obtain a range 
of values for Sn and Sp. These values were used to plot a 
graph Sn versus (1 — Sp). The areas under the resulting 
curves were integrated. By default, BlastClust uses 
MegaBlast (41) as a comparison engine for nucleotide se- 
quences and BlastP for proteins. In the context of this 
work, we only used BlastP as a comparison engine since 
MegaBlast is meant to be used for closely related se- 
quences and is therefore not sensitive enough for the 
kind of clustering considered here. 
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(Weighted) true positives 

Figure 2. Receiver-operator-like curve (ROC) for WU-Blast. The hori- 
zontal axis indicates the weighted number of reported TPs, while the 
vertical indicates the corresponding weighted number of FPs. Going 
from left to right on the top of the graph, the blue curve shows 
WU-BlastN, the orange curve corresponds to WU-BlastP/BlosumN 
and the green one shows WU-BlastR. 




(Weighted) true positives 

Figure 3. ROC for optimized Blast. Similar layout as Figure 2. Going 
from left to right on the top of the graph, the blue curve corresponds to 
default WU-BlastN, the pink curve shows NCBI-BlastN optimized by 
Roshan et al. (26), the brown curve corresponds to WU-BlastN 
optimized by Freyhult et al. (25) and the green one shows WU-BlastR. 



Database search algorithms 

In order to validate the relevance of the sequence recoding 
and the use of BlosumR matrices, we used the BlastP 
algorithm to search reRNAs against their corresponding 
reRNA databases. We named this approach BlastR. For 



Table 1. Blast flavors used in the validation and associated 
parameters 



Name 


GOP 


GEP 


Word 
Size 


Matrix 


Sequence 
tType 


NCBI-BlastN 


5 


2 


11 


+1/-3 


Nuc 


NCBI-BlastN Optimized 


8 


6 


4 


+ 5/-4 


Nuc 


NCBI-BlastP/BlosumN 


5 


2 


3 


BlosumN 


Nuc 


NCBI-BlastR 


5 


2 


3 


BlosumR 


Recoded 












Nuc 


WU-BlastN 


10 


10 


11 


+ 5/-4 


Nuc 


WU-BlastN Optimized 


20 


10 


7 


+5/-4 


Nuc 


WU-BlastP/BlosumN 


10 


10 


3 


BlosumN 


Nuc 


WU-BlastR 


10 


10 


3 


BlosumR 


Recoded 



Nuc 



'Name' indicates the Blast flavor name, 'GOP' is the gap opening 
penalty, 'GEP' the gap extension penalty. 'Word size' indicates the 
word size for seeding alignment. 'Matrix' indicates the scoring scheme 
that has been used. 'Sequence type' indicates whether the considered 
sequences were nucleotides or recoded nucleotide sequences (respective- 
ly 'Nuc' and 'Recoded Nuc'). BlastP can be used with any Blosum like 
scoring scheme (see text). 



comparison purposes, BlastR was designed to run both 
with WU-BlastP (version 2.0 MP, Gish, unpublished) 
and with NCBI-BlastP (standalone NCBI Package 
version 2.2.18). BlastR was also compared against 
several different flavors of BlastN and BlastP packages 
(Table 1). Considering the nucleic nature of the reRNA 
sequences, we used for BlastP the BlastN default gap 
penalties while keeping all other parameters to their 
default BlastP values. In order to further einulate BlastN 
behavior, sequences were reverse complemented before ex- 
tension (i.e. the recoded databases contain both the direct 
and the reverse sequence). 

The substitution matrices used were the default NCBI 
and WU-Blast matrices (+1/— 3 or +5/— 4, respectively) 
when searching RNA sequences with BlastN, and the 
BlosumN or BlosumR matrices when using BlastP. The 
analyses presented in the main body of the text were all 
carried out using WU-Blast, although similar results were 
obtained using NCBI-Blast (Figure SI in Supplementary 
Data). WU-Blast is now distributed under the name 
AB-Blast but the behavior of the AB-Blast package was 
identical to WU-Blast in our benchmarks. 

Implementation/distribution 

BlastR is a Perl wrapper for BlastP. Three wrappers are 
distributed, one for AB-Blast (Gish, unpublished data), 
one for WU-Blast, and one for NCBI-Blast. The 
package has been designed to behave like these three im- 
plementations of Blast, supporting exactly the same 
command hne. BlastR is part of the standard T-Coffee 
distribution, an open-source freeware available from 
www. tcoffee . org/blastr .htinl . 

RESULTS 

In this work, we describe a procedure for searching 
ncRNAs in databases. Our approach rehes on three 
distinct components: the use of a 16 letters alphabet 
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reflecting di-nucleotide content (recoded sequences), the 
use of the BlastP algorithm to ahgn the nucleotide se- 
quences, and the use of new substitution matrices 
trained on Rfam seed alignments (BlosumR and 
BlosumN). Our first task was to determine the net contri- 
bution of each component when doing database searches. 

We started by estimating the contribution of the BlastP 
algorithm. It is important to reahze that BlastP and 
BlastN do not use the same algorithm. BlastP takes ad- 
vantage of k-tuple similarity, while BlastN rehes on perfect 
k-tuple identity. Furthermore, the two algorithms rely on 
a different low-level parameterization (different word size, 
gap opening and gap extension penalties, HSP (high- 
scoring segment pairs) extension thresholds, two-hit/ 
single-hit algorithms, multiple-hit window size, drop-off 
value for gapped ahgnment, drop-off value for 
un-gapped extensions) and the BlastP algorithm 
supports the use of user defined log-odd matrices. It 
was, therefore, necessary to quantify how these variations 
may affect subsequent analysis. We did so by searching 
nucleotide sequences with BlastP while treating them as 
protein sequences made of a four letters alphabet and 
using as a substitution matrix BlosumN (BlastP/ 
BlosumN), a nucleotide log-odd substitution matrix esti- 
mated on the Rfam seed alignments. We estimated the 
behavior of WU-BlastP/BlosumN by running one 
selected query from each Rfam family, concatenating the 
resulting outputs and sorting them by iJ-value. The result- 
ing hst was used to produce the FP versus TP plots in 
Figure 2 ('Material and Methods' section). As a reference 
we also included WU-BlastN with default parameters. 

The results clearly show that using WU-BlastP/ 
BlosumN as a substitution matrix yields an improvement 
over BlastN. It is hard to determine whether the improve- 
ment is an effect of the matrix, the algorithm or a com- 
bination, since neither WU-BlastN nor NCBI-BlastN 
support the use of ad hoc matrices. We further quantified 
the difference in behavior between BlastN and BlastP by 
precisely estimating the number of TPs returned while ac- 
cepting a specified number of FPs. Results (Table 2), 
suggest that when accepting up to 10 FPs, WU-BlastP/ 
BlosumN reports 535 TPs, that is to say about >4% of 
WU-BlastN. We did not see a similar improvement when 



Table 2. Comparison between various Blast configurations 



Blast Flavor 


Mat 


#TP10 


#TP20 


#HSPs 


CPU (s) 


WU-BlastN 


+5/-4 


514 


538 




3 


WU-BlastP/BlosumN 


BlosumN 


535 


553 




440 


WU-BlastN Opt 


+5/-4 


533 


551 




6 


WU-BlastR 


BlosumR 


541 


557 




169 


NCBI-BlastN 


+ 1/-3 


523 


530 


51565 


6 


NCBI-BlastP/BlosumN 


BlosumN 


492 


521 


553738 


156 


NCBI-BlastN Opt 


+ 1/-3 


499 


525 


149065 


103 


NCBI-BlastR 


BlosumR 


509 


539 


171891 


89 



'Blast Flavor' indicates which Blast was used, 'Mat' indicates which 
matrix, '#TP10' and '#TP20' indicate how many weighted TPs were 
reported for a total of, respectively, 10 and 20 accepted weighted 
FPs. '#HSPs' indicates how many HSPs were reported (only for 
NCBI). 'CPU' indicates the CPU time in seconds. 



feeding NCBI-BlastP the BlosumN matrix (Figure SI in 
Supplementary Data). In WU-Blast, the increased sensi- 
tivity comes at a significant computational cost and we 
show in Table 2 that on the same benchmark data set, 
WU-BlastP takes about 150 times longer to produce a 
result than WU-BlastN. 

We then replaced BlosumN with BlosumR, the 
di-nucleotide matrix and ran the benchmark on recoded 
RNA sequences. This flavor of Blast is named BlastR and 
its benchmark behavior is shown on Figures 2 and 3. The 
improvement of WU-BlastR over WU-BlastP is modest 
(557 TPs versus 553 when accepting 20 FPs) but more 
significant when considering the NCBI package (539 
versus 521). Yet in both cases, this improvement comes 
along with a significant increase in efficiency. WU-BlastR 
requires nearly 60% less CPU (Central Processing Unit) 
time than WU-BlastP/BlosumN (Table 2) while 
NCBI-BlastR requires about 40% less CPU time than 
NCBl-BlastP/BlosumN. The lower CPU requirement is 
consistent with a significant lower number of reported 
HSPs (at least in NCBI BlastP) as shown on Table 2. 
Albeit modest, the improvement remains consistent over 
the whole TP versus FP graph. We finally compared 
BlastR with two optimized versions of BlastN, one from 
Freyhult et al. (25) and one from Roshan et al. (26). The 
results on Figure 3 and Table 2 suggest a moderate but 
consistent improvement of BlastR over these optimized 
flavors of BlastN. 

In order to further investigate the source of the im- 
provement obtained when replacing BlosumN with 
BlosumR, we re-estimated BlosumR matrices on Rfam 
seed alignments with shuffled columns. The ahgnments 
thus produced retain the same level of identity, the same 
single nucleotide mutation patterns, but lose their 
di-nucleotide dependencies. We produced 1000 such rep- 
licates based on the 792 Rfam family set on which the 
BlosumR matrix was estimated, and we used these 1000 
replicates to search Rfani-G with BlastR. The results 
(Figure 4) show the distribution of reported TPs (for 10 
accepted FPs) obtained with these replicates. Interestingly, 
the behavior of BlastR when used with shuffled BlosumR 
matrices is simflar to the behavior of BlastP/BlosumN. 
This observation suggests that the differences observed 
on Figure 2 results from the use of di-nucleotide informa- 
tion. This result measured on a large number of replicates 
confirms the existence of a modest but real signal resulting 
from di-nucleotide dependencies and susceptible to 
improve database searches. It is possible to estimate 
these dependencies by measuring the log-odd ratio 
between the observed and expected frequency of every 
di-nucleotide given the background frequencies of the 
four nucleotides. Results (Figure S2 in Supplementary 
Data) confirm that the strongest dependency is between 
a nucleotide and its immediate neighbor. The second 
peak of dependency occurs when considering the 
dependency between a given nucleotide and its +4 
neighbor; this distal dependency is, however, significantly 
lower. 

One of the main objectives of BlastR is to allow the 
identification of new ncRNA famihes. In order to estimate 
the relative merits of BlastR for that purpose, we designed 
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Figure 4. Score distribution with shuffled matrices. Thousand shuffled matrices were used to produce the corresponding number of replicate 
benchmarks. For each replicate, the weighted number of TPs for a weighted total of 10 accepted FPs was measured and used as a score for the 
replicate. The graph shows the distribution of these scores, with the vertical line on the right indicating the value associated with BlastR. 
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Figure 5. ROC analysis of clustering capacities. Horizontal axis: 
1 -specificity. Vertical axis: sensitivity. Each curve shows the trade-off 
between these two quantities when varying the inclusion threshold of 
BlastClust. Starting from the bottom right corner to the top left, the 
bottom curve corresponds to NCBI-BlastP/BlosumN, while the top 
curve corresponds to NCBI-BlastR. 



a simple benchmark test. It involves re-clustering Rfam-G 
ab initio and using an estimate of the similarity between 
the new clusters and the original one as a measure of 
success. Results are displayed on Figure 5. This test was 
used to compare BlastR and BlastP/BlosumN. The results 
are in agreement with those reported on Figure 2 and 
suggest that BlastR consistently provides a better 
trade-off between sensitivity and specificity. 



It is important to point out that with the exception of its 
substitution matrix BlastR has not been tuned in any way 
for RNA. It is inerely the off-the-shelf BlastP package 
used along with a di-nucleotide substitution matrix and 
a BlastN gap penalty scheme. We therefore did not 
make any specific attempt to properly estimate the param- 
eters of the Extreme Value Distribution (K and A) used to 
model the ii-value estimations. The problem of addressing 
the effect of unsuitable values for K and k is not new (40), 
and it is well-known that inaccurate values result in under 
or over estimated ii-values for the reported hits. To assess 
the effect on BlastR we systematically compared the 
ii-values reported by BlastN with the Values reported 
by BlastR when considering the same hit (Figure 6A and 
B). The correlation between the log (ii-value) is linear and 
on that same scale BlastR values are roughly twice lower 
than BlastN. Interestingly, the same hnear correlation and 
the same doubhng effect occur when comparing BlastN 
ii-values with BlastP/BlosumN ii-values (Figure S3 in 
Supplementary Data). This observation rules out our 
di-nucleotide matrix (BlosumR) as the main explanation 
for the doubling effect. It is, therefore, most hkely that in 
BlastR, the ii-value underestimation results from inappro- 
priate values of K and A in BlastP. Fortunately, the con- 
sequences on the present analysis are limited. The log-shift 
does not affect hit ranking, and therefore does not chal- 
lenge any of the results reported here. In practice, to 
benefit from the increased sensitivity of BlastR (or 
BlastP/BlosumN), users simply need to use an ii- Value 
of 10~^ as a threshold for significance, rather than the 
value of 10""^ that appear to be most suitable for BlastN 
on this data set. Aside froin this is-value shift the results 
on Figure 6 confirm the capacity of BlastR to discriminate 
very accurately between proven positives (blue dots) and 
proven negatives (red dots). 
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Figure 6. (A) £-values correlation between WU-BlastN and WU-BlastR. The horizontal axis indicates the ten based logarithm of the ii-values 
recorded on WU-BlastN output. The vertical axis corresponds to that same quantity measured on WU-BlastR outputs. Each dot on the graph 
corresponds to the same hit observed on two outputs. Blue dots are hits labeled as proven positives, red dots correspond to proven negatives. (B) 
Close up of the previous graph centered on the high £-value area. 



DISCUSSION 

In this work, we describe and validate a database searcli 
metliod specific for non-coding RNA sequences. We show 
that through simple sequence recoding and estimation of 
an ad hoc substitution model, one can improve the 
accuracy of standard tools such as Blast. Sequence 
recoding is achieved by switching from the nucleotide 
four-letters alphabet to a protein-hke 16-letters alphabet. 
Because they are recoded using an amino acid-Hke alpha- 
bet, the resulting sequences can be dealt with using protein 
analysis tools. 

We examined the effect of using the BlastP algorithm to 
analyze recoded RNA sequences. This recoding amounts 
to increasing the information density within sequences by 
overloading each position with information related to its 
3' neighbor. While in theory, a nucleotide could be 
informed with the nature of any other nucleotide in its 
vicinity (i.e. one, two, three or more nucleotides apart), 
we showed here that the strongest dependency occurs 
between immediate neighbors. Interestingly, we also 
found an increased signal at position +4 (Figure S2 in 
Supplementary Data). This means that if one was ready 
to use a slightly larger alphabet, it might be worthwhile 
exploring the effect of combining these three nucleotides 
(NNx xN) into a unique symbol. As shown in Figures 2 
and 3, most of the improvement reported here is achieved 
by replacing the standard identity matrix in BlastN with a 
log-odd matrix estimated on Rfam RNA alignments 
(BlosumN) and feeding this matrix to the BlastP algo- 
rithm. We then show how a more discriminative matrix 
can be estimated by considering di-nucleotide substitu- 
tions to search nucleotide sequences recoded accordingly. 
We name the resulting procedure BlastR and show how its 



increased database search performances also result in 
improved clustering capacities when evaluating the 
re-clustering of Rfam-G. The modifications we propose 
here are simple and amount to a database pre-processing 
procedure (reverse complementation, sequence extension) 
combined with the use of di-nucleotide-specific substitu- 
tion matrix (BlosumR). As such, they can be added on the 
top of any Blast-like package (PSl-Blast, Smith and 
Waterman, HMMER, etc.). We are currently distributing 
three such adaptations: one for the NCBI-Blast, one for 
WU-Blast and one for AB-Blast. We report comparable 
results when using these three packages. 

An important effect of the recoding is to create se- 
quences with a lateral dependency. In the recoded se- 
quences, some symbol combinations are not allowed 
anymore because they are incompatible with the 
underlying recoding. For instance the symbol 'D' that cor- 
responds to the 'AA' di-nucleotide cannot be followed by 
the symbol T' that correspond to 'CA'. In order to be 
compatible, two extended symbols must have identical nu- 
cleotides on their 5' and 3'. This property results from the 
fact that the di-nucleotides do not define an arbitrary 
reading frame, but merely correspond to a sliding window 
of size 2. Such a high lateral dependency would dramatic- 
ally decrease the estimated ii-values if Blast were consider- 
ing dependencies between adjacent positions. Blast, 
however, like Smith and Waterman or the most common 
string matching algorithms, ignores such dependencies. It 
rehes on the simplifying assumption that all positions are 
independent from one another and sums up the contribu- 
tion of each position to estimate the final score. As a con- 
sequence, the privileged association between two symbols 
does not affect the score. 
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Overall, the observation that taking into account di- 
nucleotide composition when searching databases results 
in an improvement should not come as a surprise. 
Nucleotides forming dinucleotides have been known for 
a long time not to be fully independent. The question is, 
therefore, not whether the di-nucleotide signal exists, but 
whether it is strong enough to make a difference when 
added on the top of any analysis. In this work, we 
designed a simple protocol to enrich standard database 
search methods with di-nucleotide signal. We show here 
that the effect is strong enough to influence sensitivity. 
Although our results suggest that the best combination 
between CPU requirement and sensitivity is obtained 
when using recoded sequences, it is worth noting that we 
also report a significant increase in accuracy by the mere 
fact of using a nucleotide log-odd matrix (BlastP/ 
BlosumN). Of course, one would have liked to test these 
nucleotide matrices within BlastN. Unfortunately, this is 
not possible as the currently available BlastN implemen- 
tations have not been designed to use ad hoc scoring 
schemes. 

Considering the importance of RNA analysis and the 
exponential growth of this research field, any small im- 
provement in detection capacities could have major con- 
sequences. It must be stressed here that BlastR is not an 
attempt to improve over structure-aware, CPU intensive 
methods, hke Infernal. These packages use RNA informa- 
tion in a way that is much closer to optimahty. Our goal is 
not to outperform them but rather to ask how close one 
can get to their accuracy while using CPU inexpensive 
methods such as Blast. For this reason, we used the 
output of these methods (namely Infernal/Rfam) as a 
standard of truth and asked what fraction of the signal 
generated by accurate methods can be recovered using 
BlastR. A niche exists for methods like BlastR since 
today even the most sophisticated procedures designed 
for the assembly of ncRNA databases like Rfam rely in 
their early stage on BlastN-based analyses. Whenever 
BlastN is used, BlastR could be used as a drop-in replace- 
ment. BlastR has roughly the same CPU requirements as 
BlastP. That makes it about 10-100 times slower than 
BlastN but 2-5 orders of magnitude faster than any 
other non-Blast based alternatives such as Infernal (21). 

In this analysis, we have focused our interest on 
evaluating how BlastR can be used to identify honiologs 
among a set of well-defined RNA sequences. This exercise 
supposes the existence of data sets made of mature RNA 
transcripts (as opposed to genes embedded within a 
genome). This situation is reahstic, given the large 
number of RNAseq experiments being performed in an 
increasing number of projects that deliver fully assembled 
transcriptomes. Of course BlastR could also be used to 
scan genomes for embedded multiexonic genes but the 
performance of our method in this precise context remains 
to be evaluated. This work is currently underway, taking 
advantage of the Rfam-G data set described here. Last but 
not least, it is interesting to note that the use of protein- 
hke sequences makes it possible to use all available flavors 
of Blast including the most sophisticated, such as PSI- 
Blast. Unfortunately, our attempts at doing so have 
been hampered by the presence of many hard coded 



protein-dependent parameters within PSI-Blast. The de- 
velopment of PSI-BlastR would therefore require a more 
invasive software engineering approach, but it is certainly 
an avenue worth exploring. 
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